Feature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context

نویسندگان

چکیده مقاله:

Nowadays, users can share their ideas and opinions with widespread access to the Internet and especially social networks. On the other hand, the analysis of people's feelings and ideas can play a significant role in the decision making of organizations and producers. Hence, sentiment analysis or opinion mining is an important field in natural language processing. One of the most common ways to solve such problems is machine learning methods, which creates a model for mapping features to the desired output. One challenge of using machine learning methods in NLP fields is feature selection and extraction among a large number of early features to achieve models with high accuracy. In fact, the high number of features not only cause computational and temporal problems but also have undesirable effects on model accuracy. Studies show that different methods have been used for feature extraction or selection. Some of these methods are based on selecting important features from feature sets such as Principal Component Analysis (PCA) based methods. Some other methods map original features to new ones with less dimensions but with the same semantic relations like neural networks. For example, sparse feature vectors can be converted to dense embedding vectors using neural network-based methods. Some others use feature set clustering methods and extract less dimension features set like NMF based methods. In this paper, we compare the performance of three methods from these different classes in different dataset sizes. In this study, we use two compression methods using Singular Value Decomposition (SVD) that is based on selecting more important attributes and non-Negative Matrix Factorization (NMF) that is based on clustering early features and one Auto-Encoder based method which convert early features to new feature set with the same semantic relations. We compare these methods performance in extracting more effective and fewer features on sentiment analysis task in the Persian dataset. Also, the impact of the compression level and dataset size on the accuracy of the model has been evaluated. Studies show that compression not only reduces computational and time costs but can also increase the accuracy of the model. For experimental analysis, we use the Sentipers dataset that contains more than 19000 samples of user opinions about digital products and sample representation is done with bag-of-words vectors. The size of bag-of-words vectors or feature vectors is very large because it is the same as vocabulary size. We set up our experiment with 4 sub-datasets with different sizes and show the effect of different compression performance on various compression levels (feature count) based on the size of dataset size.  According to experiment results of classification with SVM, feature compression using the neural network from 7700 to 2000 features not only increases the speed of processing and reduces storage costs but also increases the accuracy of the model from 77.05% to 77.85% in the largest dataset contains about 19000 samples. Also in the small dataset, the SVD approach can generate better results and by 2000 features from 7700 original features can obtain 63.92 % accuracy compared to 63.57 % early accuracy. Furthermore, the results indicate that compression based on neural network in large dataset with low dimension feature sets is much better than other approaches, so that with only 100 features extracted by neural network-based auto-encoder, the system achieves acceptable 74.46% accuracy against SVD accuracy 67.15% and NMF accuracy 64.09% and the base model accuracy 77.05% with 7700 features.  

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dimension Reduction by Mutual Information Feature Extraction

During the past decades, to study high-dimensional data in a large variety of problems, researchers have proposed many Feature Extraction algorithms. One of the most effective approaches for optimal feature extraction is based on mutual information (MI). However it is not always easy to get an accurate estimation for high dimensional MI. In terms of MI, the optimal feature extraction is creatin...

متن کامل

Feature Dimension Reduction of Multisensor Data Fusion using Principal Component Fuzzy Analysis

These days, the most important areas of research in many different applications, with different tools, are focused on how to get awareness. One of the serious applications is the awareness of the behavior and activities of patients. The importance is due to the need of ubiquitous medical care for individuals. That the doctor knows the patient's physical condition, sometimes is very important. O...

متن کامل

Sentiment analysis methods in Sentiment analysis methods in Persian text: A survey

With the explosive growth of social media such as Twitter, reviews on e-commerce website, and comments on news websites, individuals and organizations are increasingly using opinions in these media for their decision making. Sentiment analysis is one of the techniques used to analyze userschr('39') opinions in recent years. Persian language has specific features and thereby requires unique meth...

متن کامل

on the comparison of keyword and semantic-context methods of learning new vocabulary meaning

the rationale behind the present study is that particular learning strategies produce more effective results when applied together. the present study tried to investigate the efficiency of the semantic-context strategy alone with a technique called, keyword method. to clarify the point, the current study seeked to find answer to the following question: are the keyword and semantic-context metho...

15 صفحه اول

Feature Selection Methods in Persian Sentiment Analysis

With the enormous growth of digital content in internet, various types of online reviews such as product and movie reviews present a wealth of subjective information that can be very helpful for potential users. Sentiment analysis aims to use automated tools to detect subjective information from reviews. Up to now as there are few researches conducted on feature selection in sentiment analysis,...

متن کامل

Comparison of dimension reduction methods using patient satisfaction data

In this study, we compared classical principal components analysis (PCA), generalized principal components analysis (GPCA), linear principal components analysis using neural networks (PCA-NN), and non-linear principal components analysis using neural networks (NLPCA-NN). Data were extracted from the patient satisfaction query with regard to the satisfaction of patients from hospital staff, whic...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}


عنوان ژورنال

دوره 16  شماره 3

صفحات  88- 79

تاریخ انتشار 2019-12

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

کلمات کلیدی برای این مقاله ارائه نشده است

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023